-
Notifications
You must be signed in to change notification settings - Fork 725
[Executorch] make slice_copy parallel #15830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/kimishpatel/214/base
Are you sure you want to change the base?
Conversation
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15830
Note: Links to docs will display an error until the docs builds have been completed. ❌ 123 New FailuresAs of commit 6f77646 with merge base 7600df8 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) ghstack-source-id: 323355686 Pull Request resolved: #15830
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request adds parallel processing to the slice_copy operation in ExecutorTorch to improve performance during large prefills in LLMs, where slice_copy can take 5-10% of execution time (primarily from rope implementation slicing).
Key Changes:
- Added multithreading support to
compute_slicefunction with workload-based thresholds - Parallel execution distributes work across leading dimensions using
parallel_for - Single-threaded fallback maintained for smaller workloads
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| kernels/portable/cpu/util/targets.bzl | Adds threadpool dependency required for parallel execution support |
| kernels/portable/cpu/util/slice_util.cpp | Implements parallel slice_copy with multithreading when leading_dims ≥ 8 and total_elements ≥ 32768, maintaining single-threaded fallback for smaller workloads |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (use_multithreading) { | ||
| // Use parallel_for to distribute work across leading dimensions | ||
| // Calculate grain size based on number of elements per leading dimension | ||
| const int64_t elements_per_leading_dim = length * trailing_dims; |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable elements_per_leading_dim is calculated but never used. It appears this was intended for grain size calculation but MIN_LEADING_DIMS_FOR_MT is used instead. Consider removing this unused variable or using it to calculate a more dynamic grain size based on workload characteristics.
| const int64_t elements_per_leading_dim = length * trailing_dims; |
Stack from ghstack (oldest at bottom):
When doing large prefills in LLMs, slice_copy takes about 5-10% time.
Mainly coming from slicing in the rope implementation.
Differential Revision: D85532081